The purpose of this model is to predict the outcome of a standard League of Legends(LoL) ranked match based on the first ten minutes of the match. I will be using this dataset, from Kaggle by Yi Lan Ma, to predict the outcomes of a ranked LoL match based solely off the first 10 minutes of the match.
League of Legends(LoL) is a 5v5, player vs player, MOBA game. If you don’t know what that means…great! It doesn’t matter for the purposes of this model! At it’s core, LoL is a game with two teams(Blue and Red) of 5, with each player piloting a unique champion, battling it out to capture the enemy’s base (aka Destroy the Nexus).
(Visual 1.1)
So what are the things a team needs in order to win?
Objectives are are structures(towers, inhibitors, and Nexus) and monsters(dragons and Herald) that a team must take in order to progress towards victory.
redTeamNexus
There are 3 paths/lanes a team can take towards the opponent’s base. However, each lane consists of 2 turrets, an inhibitor turret, and an inhibitor preventing a team’s advance towards the opponent’s base; the Nexus itself also has 2 turrets for self-defense (Refer to Visual 1.1 for reference). Long story short, the more Objectives a team captures, the closer they are to the Nexus. If the enemy Nexus is destroyed, then the team that destroyed it is the winner of the match. Easy right? Oh, no. The game is a lot more complicated than that.
Introducing “Neutral Objectives.” Unlike Objectives, Neutral Objectives are things that both teams can take in order to get closer to victory. There are three Neutral Objectives, but our purposes, we will only be taking a look at 2 of them: Rift Herald and Dragon.
Though, a game with just objectives is not fun. What about the enemies? What determines how powerful a team is when compared to the enemy team? This is where gold comes in!
shop
Gold allows players to purchase items for their champions effectively making their champions stronger. Gold is obtained through many sources:
Basically, pretty much anything you do in LoL grants you gold.
Experience goes hand in hand with gold; it measures the level of a player’s champion. The higher level a champion is, the more and stronger abilities they will unlock. Experience is gained through:
exp
Basically, whenever you kill something, you gain experience.
So the team’s gold, experience, and objectives are basically a measurement of a team’s success during a match of League of Legends.
The gaming community generally agree on the importance of gold, experience, and objectives in a game of League of Legends; however, the debate about which of those should be prioritized is never-ending. Does having more gold lead to a team’s win or is leveling up more important? Or should a team ignore their champion’s strength and simply go for the objectives? By solving this phenomenon, we should ,theoretically, be able to win more of games.
# Load Packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.1 ✔ rsample 1.1.1
## ✔ dials 1.1.0 ✔ tune 1.0.1
## ✔ infer 1.0.4 ✔ workflows 1.1.2
## ✔ modeldata 1.0.1 ✔ workflowsets 1.0.0
## ✔ parsnip 1.0.3 ✔ yardstick 1.1.0
## ✔ recipes 1.0.4
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(kknn)
library(ggplot2)
library(corrr)
library(corrplot)
## corrplot 0.92 loaded
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(rcompanion)
##
## Attaching package: 'rcompanion'
##
## The following object is masked from 'package:yardstick':
##
## accuracy
library(vip)
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(dplyr)
# Assigning the data to a variable
raw_df <- read_csv("raw_data/high_diamond_ranked_10min.csv", show_col_types = FALSE)
Looking at the raw data, we can see that there are many variables that are just additive inverses of an already existing variable. This makes logical sense because if a team is ahead by 300 gold the other team is obviously behind by 300 gold.
Let’s take a closer look at these correlations:
corr_simple <- function(data=raw_df,sig=0.9){
#convert data to numeric in order to run correlations
#convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
df_cor <- data %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
#run a correlation and drop the insignificant ones
corr <- cor(df_cor)
#prepare to drop duplicates and correlations of 1
corr[lower.tri(corr,diag=TRUE)] <- NA
#turn into a 3-column table
corr <- as.data.frame(as.table(corr))
#remove the NA values from above
corr <- na.omit(corr)
#select significant values
corr <- subset(corr, abs(Freq) > sig)
#sort by highest correlation
corr <- corr[order(-abs(corr$Freq)),]
#print table
print(corr)
}
corr_simple()
## Var1 Var2 Freq
## 776 blueTotalMinionsKilled blueCSPerMin 1.0000000
## 813 blueTotalGold blueGoldPerMin 1.0000000
## 925 blueFirstBlood redFirstBlood -1.0000000
## 967 blueDeaths redKills 1.0000000
## 1458 blueGoldDiff redGoldDiff -1.0000000
## 1499 blueExperienceDiff redExperienceDiff -1.0000000
## 1555 redTotalMinionsKilled redCSPerMin 1.0000000
## 1592 redTotalGold redGoldPerMin 1.0000000
## 1006 blueKills redDeaths 1.0000000
## 1353 redAvgLevel redTotalExperience 0.9017484
## 574 blueAvgLevel blueTotalExperience 0.9012968
As one can see, there are quite a few variables that are just additive inverses of another variable. This means that they have a correlation of -1. At the same time, there are also variables that are perfectly correlated.
I will go ahead a drop these variables as they add no new information
that their correlated variables do not provide. There are also two
variables on both teams that are heavily correlated: AvgLevel and Total
Experience. I will not drop them here as they are not perfectly
correlated; there might be information that one provides that the other
does not. That is not to say they will never get dropped from the data
after further testing. I will also be dropping gameId as it
is irrelevant for our purposes.
# Drop duplicate variables (perfectly correlated variables)
reduced_df <- subset(raw_df, select = -c(gameId, redFirstBlood, redKills, redDeaths, redGoldDiff, redExperienceDiff, blueTotalGold, blueTotalMinionsKilled, redTotalMinionsKilled, redTotalGold))
# Convert factors into factors
reduced_df['blueWins'] = as.factor(reduced_df$blueWins)
reduced_df['blueFirstBlood'] = as.factor(reduced_df$blueFirstBlood)
reduced_df['blueDragons'] = as.factor(reduced_df$blueDragons)
reduced_df['blueHeralds'] = as.factor(reduced_df$blueHeralds)
reduced_df['redDragons'] = as.factor(reduced_df$redDragons)
reduced_df['redHeralds'] = as.factor(reduced_df$redHeralds)
reduced_df['blueTowersDestroyed'] = as.factor(reduced_df$blueTowersDestroyed)
reduced_df['redTowersDestroyed'] = as.factor(reduced_df$redTowersDestroyed)
reduced_df %>%
select(is.numeric) %>%
cor() %>%
corrplot(type = "lower", diag = FALSE)
## Warning: Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
## # Was:
## data %>% select(is.numeric)
##
## # Now:
## data %>% select(where(is.numeric))
Our data is already looking much better; the duplicate variables are all gone! There are still heavily correlated variables but we’ll take a closer look at them when we fit our model.
Since I got the data from kaggle, there most likely will not be any missing data. But just in case, let’s double check:
sum(is.na(reduced_df))
## [1] 0
There is a sum of 0 missing values in our data. Therefore, there are no missing values.
write.csv(reduced_df, "./clean_data/cleaned_data.csv")
Now that our data is clean and ready to work with; let’s get to know our variables! The following are variables that we will be working with in our machine learning process(there is a more detailed Codebook in the clean_data directory):
blueWins: Did the team win or lose(1 or 0)blueWardsPlaced: The team’s number of wards placedblueWardsDestroyed: The team’s number of wards
destroyedblueFirstBlood: Did the team get the first kill of the
match?blueKills: The team’s number of killsblueDeaths: The team’s number of deathsblueAssists: The team’s number of assistsblueEliteMonsters: The number of Elite Monsters that
the team took downblueDragons: The total number of dragons that the team
took downblueHerald: The total number of Heralds that the team
took downblueTowersDestroyed: The total number of towers that
the team took downblueAvgLevel: The team’s average level across all five
playersblueTotalExperience: The total expeience gained by all
five playersblueTotalJungleMinionsKilled: total amount of jungle
monsters killed by all five playersblueGoldDiff: the gold difference between the two
teamsblueExperienceDiff: the experience difference between
the two teamsblueGoldPerMin: the team’s gold income per minuteblueCSPerMin: the team’s minions killed per minuteredWardsPlaced: the enemy team’s number of wards
placedredWardsDestroyed: the enemy team’s number of towers
destroyedredAssists: the enemy team’s number of assistsredElitsMonsters: The number of Elite Monsters that the
enemy team took downredDragons: The total number of dragons that the enemy
team took downredHerald: The total number of Heralds that the enemy
team took downredTowersDestroyed: The total number of towers that the
enemy team took downredAvgLevel: The enemy team’s average level across all
five playersredTotalExperience: The enemy team’s total experience
gained by all five playerredTotalJungleMinionsKilled: total amount of jungle
monsters killed by the enemy teamredCSPerMin: the enemy team’s minions killed per
minuteredGoldPerMin: the enemy team’s minions killed per
minuteUpon first glance, there are a lot of variables; however, if we think back to when I introduced gold, experience, and objectives, it’s apparent that most of these variables are contributing to those three factors.
As reference, let’s visualize a few of their relations.
ggplot(reduced_df, aes(x = blueKills,
y = blueGoldPerMin)) +
geom_point(alpha=0.25) +
labs(title = "Relationship between Gold and kills")
Haha, if this is not clear that more kills leads to more gold, then I don’t know what is. What about kills and experience?
ggplot(reduced_df, aes(x = blueKills,
y = blueTotalExperience)) +
geom_point(alpha=0.25) +
labs(title = "Relationship between Experience and kills")
Another positive linear relation! So yes, our data can really be summed up by gold, experience, and objective; however, this also means that there are many highly correlated variable as we saw in the correlation plot, and we will keep this in mind.
With that being said, let’s take a look at our major factors in a team’s ability to win.
After thinking about all of our variables, everything can be summed up with gold, experience, and objectives. Let’s see how they affect a team’s win one-by-one.
One of the most important factors that determines a team’s ability to win is how many objectives they have. Below are percent stacked bar charts for two major objectives in a match of League of Legends: dragons and towers.
ggplot(reduced_df, aes(fill=blueDragons, x=blueWins)) +
geom_bar(position="fill")
ggplot(reduced_df, aes(fill=blueTowersDestroyed, x=blueWins)) +
geom_bar(position="fill")
As one can see, in the case that blue team wins, more objectives were captured in the first 10 minutes relatively speaking when compared to the outcome that blue loses. However, there are a considerable amount of data that shows a team winning without capturing an objective in the first 10 minutes. When this happens, a team is most likely winning through the lead that that the team has acquired through gold and experience from CS and Kills. Let’s take a look at those next:
ggplot(reduced_df, aes(x = blueTotalExperience,
y = blueGoldPerMin,
color=blueWins)) +
geom_point(alpha=0.25) +
labs(title = "Relationship between Gold, Exp, and Win")
The graph above shows that more the more gold and experience a team has, the more likely they are to win the match. Which makes a lot of sense; gold, experience, and objectives are the core to winning a match of league of legends.
So this supports our theory that the outcome of a match of League of Legends is highly related to the gold, experience, and objectives that a team acquires. Now for the controversial question, which out of those three, is more important than the others?
There is only so much a team can do in the first 10 minutes of a match; it would be much better if we know what to focus on for a higher chance of winning.
In order to find out more about our data, we will now fit a few machine learning models to see what they have to say about what’s important when trying to win a match on League of Legends.
Let’s set up our data for machine learning methods by splitting the data into training and testing sets, creating a recipe for our data, and set up cross validation.
Before we perform a split, there is something that we must address.
We must stratify our data on blueWins in order to preserve
the proportions of win-lose ratio present in our raw data.
However, something interesting we must take note of is that it is argued in LoL that the blue team has a higher chance of wining because of ‘pick/ban phase.’ We will not get into the nitty-gritty of pick/ban and simply take for granted that blue side may have a higher win percentage. In fact, let’s see if this trend exists within our model:
count(reduced_df, blueWins)
## # A tibble: 2 × 2
## blueWins n
## <fct> <int>
## 1 0 4949
## 2 1 4930
Hmm, it seems that our model is actually quite balanced in terms of
win-lose ratio; however, we must still stratify or split by
blueWins in order to keep this 50-50 ratio across our
training and testing sets.
I will go ahead and perform the initial split of the data into 80-20
train-test, and I will be stratifying the data on
blueWins.
set.seed(10502)
df_split <- initial_split(reduced_df, prop = 0.80, strata = blueWins)
df_train <- training(df_split)
df_test <- testing(df_split)
Now let’s formulate our recipe! Remember the problem we had with high
correlation between our predictors? We will use step_zv()
you get rid of any of those troublesome predictors here in our recipe.
Furthermore, although we are interested in gold, experience, and
objectives, we will still fit all of our non-highly correlated
predictors as there may be certain factors that are not accounted for by
gold, experience, and objective such as wardsPlaced and
wardsDestroyed.
# Recipe
LoL_recipe <- recipe(blueWins ~ ., data = df_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
# Prep & Bake
LoL_recipe %>%
prep() %>%
bake(new_data = df_train)
## # A tibble: 7,903 × 34
## blueWardsPl…¹ blueW…² blueK…³ blueD…⁴ blueA…⁵ blueE…⁶ blueA…⁷ blueT…⁸ blueT…⁹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.325 -0.382 0.924 -0.0536 1.06 -0.877 -1.04 -0.743 -1.47
## 2 -0.574 -0.841 -0.400 -0.394 -0.410 -0.877 -1.04 -1.39 -0.757
## 3 -0.405 -1.30 0.262 1.65 -0.654 0.722 -1.69 -1.43 -0.454
## 4 1.17 -0.841 -0.731 -0.394 -0.410 0.722 0.274 0.0202 0.458
## 5 2.96 0.535 -0.0689 -0.0536 -0.165 -0.877 0.274 0.512 0.660
## 6 -0.349 -0.382 -0.400 2.33 -0.899 -0.877 -1.69 -1.50 -0.251
## 7 -0.349 0.0763 0.262 0.287 0.324 -0.877 0.930 0.499 1.07
## 8 -0.124 0.0763 -0.731 -0.734 -0.165 -0.877 -1.04 -1.10 -2.28
## 9 0.606 -0.382 1.59 1.65 0.0793 0.722 0.274 0.482 -1.06
## 10 1.17 0.0763 -1.06 0.287 -0.899 0.722 -0.382 -0.869 0.357
## # … with 7,893 more rows, 25 more variables: blueGoldDiff <dbl>,
## # blueExperienceDiff <dbl>, blueCSPerMin <dbl>, blueGoldPerMin <dbl>,
## # redWardsPlaced <dbl>, redWardsDestroyed <dbl>, redAssists <dbl>,
## # redEliteMonsters <dbl>, redAvgLevel <dbl>, redTotalExperience <dbl>,
## # redTotalJungleMinionsKilled <dbl>, redCSPerMin <dbl>, redGoldPerMin <dbl>,
## # blueWins <fct>, blueFirstBlood_X1 <dbl>, blueDragons_X1 <dbl>,
## # blueHeralds_X1 <dbl>, blueTowersDestroyed_X1 <dbl>, …
Now that our data has been split into training and testing data and a recipe has been made, let’s decide on a cross-validation method to use for our training data. The reason why we need a cross validation is because we need to see how accurate our model is before testing it with the test data. To do this, we will further split our training data into training and validation data; we will train a model on the training data and assess it using the validation data before running it with the test data. This will prevent our model from overfitting to the training set.
df_folds <- vfold_cv(df_train, v = 5, strata = blueWins)
Here, we will be using 5-fold cross-validation since we only have around 8,000 observations in our training data. This way, we will have around 1600 observations in each fold; a pretty good number for training.
Saving our folds, recipe, and train/test split for our machine learning scripts.
save(df_folds, LoL_recipe, df_train, df_test, file=
"./RDAfiles/splitrecipefold.rda")
Now that we have our folds, recipe, and split; we are ready to start feeding our data into our models to begin the tuning process.
The models will be tuned in a separated R script; however the basic steps for our model tuning goes like this:
By following these steps above, I went ahead and tuned 6 different models for our purposes. The models are Elastic Net, K–Nearest Neighbors, Logistical, Random Forest, Boosted Tree, and Support Vector Machine. We will go ahead and load in the results now:
load('./RDAfiles/glm_res.rda')
load('./RDAfiles/knn_res.rda')
load('./RDAfiles/log_res.rda')
load('./RDAfiles/rf_res.rda')
load('./RDAfiles/bt_res.rda')
load("~/PSTAT131/Final Project/RDAfiles/svm_res.rda")
Now that we have our tuning results; how do we know which model to
use and what hyperparameters to use with it? We will evaluate the
correctness of our models by the roc_auc values; this is
their area under the ROC curve.
An ROC curve is a graph showing the performance of a classification model at all classification thresholds. The ROC curve measures the True(sensitivity) and False(1 - specificity) positive rates.
(Visual 1.2)
The Area Under the ROC Curve(AUC) provides an aggregate measure of performance across all possible classification thresholds. AUC is the probability that the model ranks a random positive example more highly than a random negative example. In other words, the higher the AUC value, the better the model performs.
We want to see which models and their hyper parameters had the
largest roc_auc values across all of out folds. Below is a
table showing the best performing tuned hyperparameters for all our
models according to their roc_auc values:
rf_best <- show_best(tune_rf, metric = 'roc_auc', n = 1)
bt_best <- show_best(tune_bt, metric = 'roc_auc', n = 1)
log_best <- show_best(log_res, metric = 'roc_auc', n = 1)
knn_best <- show_best(knn_res, metric = 'roc_auc', n = 1)
glm_best <- show_best(glm_res, metric = 'roc_auc', n = 1)
svm_best <- show_best(glm_res, metric = 'roc_auc', n = 1)
comparison_tb <- bind_rows(rf_best, bt_best, log_best, knn_best, glm_best, svm_best)
models <- c('random forest', 'boosted tree', 'logistical','knn','elastic net', 'support vector machine')
comparison_tb <- cbind(models, comparison_tb)
comparison_tb<- comparison_tb[order(comparison_tb$mean, decreasing = TRUE),]
comparison_tb
## models mtry trees min_n .metric .estimator mean n
## 5 elastic net NA NA NA roc_auc binary 0.8102488 5
## 6 support vector machine NA NA NA roc_auc binary 0.8102488 5
## 3 logistical NA NA NA roc_auc binary 0.8101318 5
## 2 boosted tree 8 100 NA roc_auc binary 0.8082945 5
## 1 random forest 1 200 75 roc_auc binary 0.8075759 5
## 4 knn NA NA NA roc_auc binary 0.7464552 5
## std_err .config learn_rate neighbors penalty mixture
## 5 0.003540821 Preprocessor1_Model091 NA NA 0 1
## 6 0.003540821 Preprocessor1_Model091 NA NA 0 1
## 3 0.003706796 Preprocessor1_Model1 NA NA NA NA
## 2 0.007577027 Preprocessor1_Model036 3.162278e-06 NA NA NA
## 1 0.003542156 Preprocessor1_Model061 NA NA NA NA
## 4 0.004316331 Preprocessor1_Model10 NA 10 NA NA
Looking at our models, they all performed around the same with
knn being the only model scoring below 0.80 on the mean
roc_auc value. However, we will take a look at the best
performing model: the elastic net(mean roc_auc = 0.810).
Along with our best performing, we will also take a look at the boosted
tree model(mean roc_auc = 0.808) just for fun!
Now that we know which models we will use to answer our questions about gold, experience, and objectives in League of Legends; let’s begin fitting our models to our training data.
However, before we begin fitting the models to the training data, let’s take a closer look at the parameters our tuning step chose for our models. Starting with the Elastic Net model:
autoplot(glm_res)
It looks like the Elastic Net model prefers a smaller penalty if the mixture is large and a greater penalty when mixture is small; however the most optimal ended up being a penalty of 0 and mixture of 1. This suggests that a pure lasso model fits our data the best!
What about our boosted-tree?
autoplot(tune_bt)
It looks like the number of trees in our model does not matter that
much as number of trees does not have a significant effect on our
roc_auc. The number of randomly selected predictors
(mtry) seems to affect the roc_auc slightly,
as an increase in mtry seem to cause a decrease in
rocauc`. Finally, the learning rate seems to be most
optimal around the middle of our tuning range.
In the end, our most optimal boosted tree has a mtry =
8, trees = 100, and learn_rate =
3.162278e-06.
Now we will fit our entire training set to our models using the most optimal parameters we found during our tuning step. Luckily, R does this quite easily for us as we only need to give R our hyperparameters, workflow, and the training data.
Elastic Net:
# Fitting an elastic net to the training set
best_para_glm <- select_best(glm_res, metric = 'roc_auc')
glm_mod <- logistic_reg(mixture = tune(), penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
glm_wf <- workflow() %>%
add_model(glm_mod) %>%
add_recipe(LoL_recipe)
glm_final <- finalize_workflow(glm_wf, best_para_glm)
glm_final_fit <- fit(glm_final, data = df_train)
Boosted Tree:
# Fitting a boosted to the training set
bt_spec <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
bt_wf <- workflow() %>%
add_model(bt_spec) %>%
add_recipe(LoL_recipe)
best_complexity <- select_best(tune_bt, metric = 'roc_auc')
bt_final <- finalize_workflow(bt_wf, best_complexity)
bt_final_fit <- fit(bt_final, data = df_train)
Now we have two models fitted on our training data; an elastic net
model with a mean roc_auc value of 0.810, and a boosted
tree model with a mean roc_auc value of 0.808. However, how
will these numbers hold up when we test our models with our testing
set?
By using a set of data that never-seen before by our model, will our
model continue to perform with high roc_auc values or will
we find that our model has flaws such as overfitting?
Let’s look at them individually.
First, let’s take a look at the boosted tree model.
(Visual 1.3)
A boosted tree or gradient boosting combines weak “learners” into a single strong learner in an iterative fashion. The model starts off by fitting a single decision tree and evaluate how well our tree performs with a loss function. We then attempt to make the decision tree more accurate by lowering loss; therefore, we need to find a gradient descend for the loss function.
(Visual 1.4)
This is where learning rate comes in; the learn_rate is
a limit on how much we go down the gradient descend curve to prevent our
model from going down too much. For our fit, our most optimal
learn_rate is 3.162278e-06. Along with our other
hyperparameters mtry = 8(predictors in the node) and number
of trees = 100(number of trees to fit).
Now that we understand boosted trees a little more, let’s fit it to our testing data:
final_bt_model_test <- augment(bt_final_fit, df_test) %>%
select(blueWins, starts_with(".pred"))
Here it is, the moment of truth. So we have fitted our model to the testing data, let’s evaluate how the model did. Once again we will look at the ROC curve:
roc_curve(final_bt_model_test, truth = blueWins, .pred_0) %>%
autoplot()
Our ROC curve looks alright, it is pushing to look like a square (around 75% of a square). This is good, as it suggests that our model is making the distinction between winning and losing a match of League of Legends. Let’s see the actual Area Under the ROC Curve:
roc_auc(final_bt_model_test, truth = blueWins, .pred_0) %>%
select(.estimate)
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.805
The actual value of roc_auc is 0.805. Great, this means
that our model is able to distinguish between positive and negative
class with about 0.805 efficiency!
Now I know 0.805 is not the best, but it is quite reliable for the purposes of this project. So can this boosted tree model tell me what is the most important factor when it comes to winning a game of League of Legends?
Let’s take a look at the model’s measurements for the importance of variables:
bt_final_fit %>% extract_fit_parsnip() %>%
vip() +
theme_minimal()
Clearly, Gold is the most important factor followed by experience. So, according to our boosted tree model, the more gold a team has, the more likely they will be able to take the win in a match of League of Legends.
Before we move on to the Elastic Net Model, let’s take a look at how many correct classifications our boosted tree actually made:
conf_mat(final_bt_model_test, truth = blueWins,
.pred_class) %>%
autoplot(type = "heatmap")
Our model correctly predicted 734 loses out of 990 matches that actually ended in Blue team’s lost; and, our model correctly predicted 709 wins out of 986 matches that actually ended in Blue team’s win. Not Bad!
Now let’s move on to our best performing model: the Elastic Net.
The Elastic Net is a regularized method that combines the L1 and L2 penalties from the LASSO and Ridge regression. The Elastic Net solves the ‘small n, large p’ problem that LASSO faces with the addition of the Ridge regression penalty.
\[ {\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).} \]
A good way to thing about the Elastic Net method is that its a flexible net that catches everything in between the LASSO and Ridge methods.
(Visual 1.5)
For our model, thought, the most optimal hyperparameters ended up being a penalty of 0 and mixture of 1. This actually means that a pure LASSO model fits our data the best.
Alright, enough about the methodology, let’s fit the model to our testing data.
final_glm_model_test <- augment(glm_final_fit, df_test) %>%
select(blueWins, starts_with(".pred"))
Similar to what we did in the boosted tree model, we will also take a look at the ROC curve for our Elsatic Net model.
roc_curve(final_glm_model_test, truth = blueWins, .pred_0) %>%
autoplot()
Upon first glance, the ROC curve seems to be the exact same as out ROC curve for the boosted tree; let’s compare there values directly.
roc_auc(final_glm_model_test, truth = blueWins, .pred_0) %>%
select(.estimate)
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.808
So the boosted tree and elastic net have a difference of 0.003
roc_auc on our testing data, so both models did alright on
the testing data. The roc-auc estimate for our elastic net
model fell by 0.002 on the testing set; there is very little overfitting
which means our 5-fold cross validation did a great job. Yay!
Once again, let’s see what our elastic net model thinks is more important: gold, experience, or objectives in a match of League of Legends.
glm_final_fit %>% extract_fit_parsnip() %>%
vip() +
theme_minimal()
So, yes, gold is the most important factor when it comes to winning a match of League of Legends.
Last but not least,let’s take a look at how many win-loses our model correctly predicted:
conf_mat(final_glm_model_test, truth = blueWins,
.pred_class) %>%
autoplot(type = "heatmap")
The accuracy of our elastic net model seems to be around the same as the boosted tree; interestingly, the elastic net seems to slightly better at predicting wins than the boosted tree model.
In a game of League of Legends, there are many factors than affect a
team’s ability to win; however, we narrowed the long list of factors
down to three major ones: gold, experience, and objectives. Then, we
used a few machine learning models to see if we can predict the outcome
of a LoL match with the given factors. We were able to determine that
the most important factor in deciding a match’s outcome is
gold. More specifically, blueGoldDiff which is
the team’s lead in gold against the opponent’s team.
But by how much gold does a team need in the first ten minutes for a greater chance to win?
Below is a box plot for the Gold Difference against win-loss.
ggplot(reduced_df, aes(x = blueWins,
y = blueGoldDiff)) +
geom_boxplot() +
labs(title = "Relationship between Gold and win")
So it seems that a team needs, on average, a 1000 gold lead at the 10-minutes mark is more likely to win. Also, it seems that the reverse is true as well, which makes sense; a team that is 1000 gold behind a the 10-minute mark is more likely to lose the match.
League is a game that is constantly changing, a model done with an updated data set might be more accurate to today’s League of Legends. One of the biggest problems when fitting our models is the overwhelming amount of linear dependence in predictors; in the future, I would look for other methods of dealing with linear dependence in our data rather than simply removing them. One approach that I have in my mind is to use Principle Component Analysis to project the data onto eigenvectors before fitting. Finally, a different model may fit this data better; however considering the best model was a Elastic Net(LASSO), I am unsure if a more complex model can do any better.
With that being said, thank you for reading this report. I hope it was somewhat insightful for League of Legends player and non-players alike.